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In this paper we introduce an iterative voting algorithm and 
then use it to obtain a rating method which is very ro- 
bust against collusion attacks as well as random and biased 
raters. Unlike the previous iterative methods, our method is 
not based on comparing submitted evaluations to an approx- 
imation of the final rating scores, and it entirely decouples 
credibility assessment of the cast evaluations from the rank- 
ing itself. The convergence of our algorithm relies on the 
existence of a fixed point of a continuous mapping which is 
also a stationary point of a constrained optimization objec- 
tive. We have implemented and tested our rating method 
using both simulated data as well as real world data. In 
particular, we have applied our method to movie evalua- 
tions obtained from MovieLens and compared our results 
with IMDb and Rotten Tomatoes movie rating sites. Not 
only are the ratings provided by our system very close to 
IMDb rating scores, but when we differ from the IMDb rat- 
ings, the direction of such differences is essentially always 
towards the ratings provided by the critics in Rotten Toma- 
toes. Our tests demonstrate high efficiency of our method, 
especially for very large online rating systems, for which 
trust management is both of the highest importance and 
one of the most challenging problems. 

1. INTRODUCTION 

Human computation is a new model of distributed com- 
puting 4,42^ in which the computational power of machines 
is augmented by the cognitive power of human beings. The 
main benefit of such a model comes from the fact that many 
problems which are trivial to humans are still intractable for 
machines. Human computation has been employed to solve 
a wide variety of problems such as spam detection f3B] , ques- 
tion answering [31] . tagging photos [35], as well as many 
others [ID] . 

E-Commerce is another area in which human computation 
has been widely used to assess quality of products as well 
as trustworthiness of people. Continuous growth of online 



commerce as well as of many other forms of online interac- 
tions, crucially depends on trust management. For exam- 
ple, in online markets, due to a huge number of potential 
sellers with varying reputation as well as a huge number of 
available products, sometimes of dubious quality, buyers rely 
on the feedback of other customers who have shared their 
experiences with the community to help them make their 
decisions. One common way of sharing such information 
and experiences is through Online Rating Systems. In such 
systems, providers (either manufacturers or just sellers) ad- 
vertise their products and customers evaluate them based on 
their experience of dealing with that particular product or 
vendor. Based on such individual evaluations received from 
customers, the system determines a rating score for every 
product, reflecting the overall quality of the product from 
customers' point of view. Yelp [40] . IMDb [15] and Ama- 
zon [5] are some of the popular online systems with rating 
facilities. 

An important issue in these systems is the trustworthiness 
of the cast feedback. Many pieces of evidence [121116] show 
that users may try to manipulate the ratings of products by 
casting unfair evaluations. Unfair evaluations are evalua- 
tions which are cast regardless of the quality of the product 
and usually are given based on personal vested interests of 
the users. For example, providers may try to submit sup- 
porting feedback to increase the rating of their products and 
consequently increase their revenue [12j . The providers also 
may attack their competitors by giving low scores in their 
feedback on their competitor's products. Another study 
shows that some sellers in eBay boost their reputation un- 
fairly by buying or selling feedback [16j . 

Unfair evaluations are broadly divided into two categories 
34 , 39] : (i) individual and (ii) collaborative. Unlike the in- 
dividual unfair evaluations, collaborative unfair evaluations, 
also called collusion 32,34], are cast by a group of users who 
try to manipulate the rating of a product collaboratively. 
Collusion, by its nature, is more sophisticated and harder 
to deal with than unfair evaluations cast individually 32 . 

Although detection of unfair evaluations, mainly collu- 
sion, has been widely studied in the recent literature [231 
I28ll32|[34ll39| . some serious challenges remain. Generally, 
existing collusion detection techniques rely on either us- 
ing temporal and behavioral indicators to identify colluding 
groups [231128) or on continuous monitoring of the behavior 
of the system, looking for unusual behavior [381139] . 

Identifying colluding groups usually involves looking for 
cliques; however, relying on solving or approximating such 
NP-hard problems clearly degrades the performance of these 
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9(a) Rating a product 9(b) Corresponding Voting Scheme 

Figure 1: Rating Through Voting - Method Overview 



systems. Heuristic approaches, such as the Frequent Itemset 
Mining technique [1] used in [28] . do not solve the problem 
either, without impacting the accuracy of the system. Tun- 
ing indicators or monitoring systems in order to make them 
sufficiently sensitive to collusion attacks, without an exces- 
sive number of false alarms, can be a daunting task, usually 
relying on machine learning techniques. However, preparing 
an adequate training dataset for such systems is yet another 
serious challenge [28] . 

Moreover, existing collusion detection systems generally 
use local quality metrics, i.e., metrics which are calculated 
from a small subset of data, and are thus vulnerable to ma- 
nipulation. The more global the metric is, the more ef- 
fort is needed to manipulate it. In other words, in order 
to manipulate a quality metric which is calculated based on 
the behavior of the entire community, one must change the 
sentiment of the community; for online communities with 
millions of users this might be an intractable task. For ex- 
ample, the dependence of the PageRank of any particular 
web page on the PageRanks of virtually all other webpages 
on the Internet is what makes the PageRank robust against 
manipulation [29] . 

In this paper we propose a method called ' Rating-through- 
Voting' to help address the above problems. Our method 
first reduces the problem of rating of a product to an "elec- 
tion" . In such an election, users' feedback is seen as a vote 
on the most appropriate rating level for the product from 
a list of a few (usually at most ten) available levels. We 
propose an iterative voting algorithm which for each level in 
each election list calculates how credible it is to be accepted 
as the community sentiment about the product. Using such 
calculated credibility degrees, we assign each voter a trust- 
worthiness degree reflecting to what extent she has behaved 
in accordance with the community sentiment. In the next 
round of iteration we employ these trustworthiness ranks to 
recalculate such credibility degrees for each level. We iter- 
ate this process until it converges, i.e., until new calculated 
credibility degrees are sufficiently close to the previous ones. 
The convergence of the algorithm is guaranteed with the ex- 
istence of a fixed point of a continuous mapping, which also 
happens to be a stationary point of a constrained optimiza- 
tion objective function. 

Such credibility degrees of rating levels are then aggre- 
gated into rating scores of products in a fully decoupled 
way, allowing the system to choose aggregation method best 
suited to the intended application. We have tested our 



method both on simulated data involving very large col- 
lusion attacks as well as on real world movie rating data. 

The remainder of the paper is organized as follows. A 
general overview of our method along with some basic defi- 
nitions is in Section[2] In Section|3]we describe the details of 
our voting algorithm and its use for rating ('Rating-through- 
Voting'). In Section [4] we present an application scenario 
with its implementation details; in Section [5] we discuss the 
evaluation results of our method. Section [6] is devoted to 
some of the related work; our conclusions are presented in 
Section [7] 



2. PRELIMINARIES 
2.1 Definitions 

Since some of our terminology, such as "a voter" or "an 
election", is also used either in everyday English or other 
technical fields, we now specify the intended meaning of such 
terms as we use them in this paper. 

An evaluation is a feedback given by a person on a product. 
We call the person who casts the evaluation a rater or a 
voter. In order to avoid overusing a term, we may use words 
evaluation, vote and feedback synonymously. 

An election is the process of choosing one item from a finite 
list of items. When we speak about rating through voting, 
the election items on a voting list are the possible rating 
levels for the quality of a product; for example, 1-10 for 
rating movies in IMDb or 1 — 5 to represent quality of a 
product in Amazon online market. 

Trustworthiness of a voter is a metric which shows to 
what extent the voter has been voting in accordance with 
the sentiment of the community. 

Credibility of a an item on an election list is a metric which 
shows the level of community approval of that item. 

Rating Score are the scores which are produced by our 
system or other systems which we use for comparison pur- 
pose to reflect the quality of a product. We may use words 
'rating scores', 'scores' or 'ranks' interchangeably to refer to 
such scores. 

Of course, the precise meaning of the above terms will 
become apparent only after we present their technical usage. 
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2.2 Method Overview 

Our method is based on the idea of reducing rating to 
voting. In rating systems we have a list of products to be 
rated by a group of users. Every users can express her opin- 
ion on the quality of the product by casting an evaluation of 
the quality level which describes the best the opinion of the 
user. For example, in Figure ; 
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three users have evaluated 
a product by casting their evaluations. 

In most rating systems the assigned rating score is a (pos- 
sibly weighted) average of the cast votes. However, we can 
also look at rating process as an election. In such an elec- 
tion, the voters are the people who are evaluating the prod- 
uct; the candidates are all possible values which a voter can 
choose to express her opinion about the product; for exam- 
ple, 1 star to indicate a low quality or 5 stars to indicate a 
high quality. We will use a technical term 'an item' to refer 
to such candidates. Our algorithm aims to determine the 
winner of such an election in a robust way while reflecting 
the prevailing opinion of the community on the quality of 
the product. The Figure 1(b) depicts such an election. 

Therefore, for each product in the system, we generate 
one election and all people who have rated that product are 
considered as having voted in the corresponding election. In 
a 'classic election', the candidate who has received the ma- 
jority of the cast votes will be the winner of the election. 
However, to make our elections robust against collusion at- 
tacks, we evaluate voters based on how their choices are 
supported by other community members. The closer the 
votes of the voter to some form of a 'community consen- 
sus', the higher her trust rank is. In aggregating the votes, 
rather than considering them all equal, each voter's vote 
has the value equal to the voters trust rank. Such obtained 
trust ranks of voters and credibility levels of the candidates 
can be now used to compute the rating scores of products. 
We do not presuppose any particular method for computing 
such rating scores; such computation can be done by freely 
choosing a method best suited for the specific application. 
In our tests we have used one of many such possible options, 
see Section \3. 41 



3. RATING-THROUGH-VOTING 
3.1 Basic Concepts and Notation 

Setup: We will assume that a set of N voters Vi, . . . , Vn 
are given a collection of L lists Ai, . . . , Al', each list A; con- 
tains rii many items A; = {/{, . . . , J„ }, and the voters are 
asked to chose the "best" item on each list. Not every voter 
is obliged to vote for the best item on every list, but can 
choose to vote on a subset of these lists. 

Problem: As the system receives these votes, the task is to 
assess: 

1. trustworthiness of all voters; 

2. level of "community approval" for each item. 

In order to make such estimate of the level of "community 
approval" robust against possible unfair voting practices of 
the participants, the assessment of the trustworthiness of 
voters should have the following features (at the moment 
we allow these features to be specified both very vaguely 
and in a somewhat circular way). 



1. voters who vote on a large number of lists, and whose 
choices are largely in agreement with the prevailing 
sentiment of the community of voters, should obtain 
higher level of trustworthiness than the voters who 
vote on only a few of these lists, or vote inconsistently 
with the prevailing sentiment; 

2. voters who seldom vote or voters who vote more or less 
randomly, just to be seen as active in the community, 
and then choose to vote unfairly on a few particular 
lists for their choices which are not favored by the re- 
maining voters, should not be able to secure election 
of their choices even if such colluding voters are a large 
majority for those particular lists. 

Note that it is not necessary that the voters vote simul- 
taneously for all lists; in fact, the outcomes of voting on all 
but one list might have already been determined. We want 
to decide the outcome for the latest election using the past 
voting pattern of the voters (but NOT the outcomes of the 
past voting). 

3.2 Vote Aggregation Algorithm 

Let us first introduce some notation: 

• r — > U denotes the fact that voter V r has participated 
in voting for the best object on list A; and has chosen 
item Ii\ 

• ni denotes the number of items on list I; 

• for each item l\ on list A; we will keep track of its 
level of community approval at the step of iteration p, 



denoted by p[^ ; 



• these individual community approval levels p£ will be 
collected into a single vector 

p = (pu : 1 < I < L, 1 < i < m); 

thus, if we let M = YIkkl ni > then p £ R ; 

• we define (p)i to be the projection {pu : 1 < i < ni) 
of p to the subspace corresponding to a single list A; . 

• for each voter V r we will also keep track of his trust- 
worthiness T^ p ' at the stage of iteration p. 

• for each p > 1 we denote by \\x\\ p the usual p-norm of 
the vector x — (xi, . . . , x n ), i.e., 



Algorithm [T] shows our voting algorithm. We now explain 
the intuitive motivation behind it. We start with the usual 
vote count, where every voter has one, equally worth vote 
as every other voter. For each item l\ £ A; on a list Ai , the 
initial rank p[^ of l\ is simply the number of votes which 
this item has received, normalized so that . py = 1. In the 
next round of iterative aggregation of the votes, each voter 
first gets its trustworthiness rank, which is the sum total of 
the ranks of all the items which he has voted for, "prorated" 
by a monotonically increasing function f(x) = x a ; we will 
later explain such choice of f(x). 
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Algorithm 1 Adaptive Voting Algorithm 
Initialization: Let e > be the precision threshold, a > 1 
a discrimination setting parameter and 
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will be obvious from the convergence proof below, the choice 
f(x) = x a has a natural motivation, defining a commonly 
used norm on the vector space of trustworthiness ranks. We 
now rigorously prove that our algorithm converges and char- 
acterize the values p which it produces. 

3.3 Convergence proof 

For a given 'community approval' ranking p of items, re- 
ferred in the sequel simply as ranks of items, let us define 
the corresponding trustworthiness T r (p) of a voter V r as 



T r(P)= ft> fc ' 
p,k : r—^pk 



(3) 



and denote by T(p) the vector of such trustworthiness ranks, 
f (p) = {T 1 (p),T 2 (p),...,T N (p)}; then 

\\T\\ a+ x = & r T r a + 1 (p))^ 

is the a + 1-norm of the vector T(p) € R. . 

We now wish to assign the ranks to the items ranked so 
that: 

1. for every list A;, the vector of ranks of all items on 
that list is a unit vector, i.e., ||(p)i||2 = 1; 



The idea is that now voters themselves can be judged by 
the selections they have made; a high trustworthiness rank 
will be given only to voters who have often chosen candi- 
dates favored by many other members of the community, 
thus voting in accordance with the prevailing community 
sentiment. Such voters can be considered as "reliable vot- 
ers" , choosing candidates in accordance with the community 
sentiment. On the other hand, those who badly judge oth- 
ers will themselves receive a low trustworthiness rank. We 
could not help mentioning that this idea is remarkably old: 
"Judge not, and you shall not be judged ..." as well as 
"Judge not, that you be not judged" (The New Testament, 
Luke 6:37 and Matthew 7:1, respectively). 

Now we recalculate the ranks of items using ([5} ; thus each 
received vote is now worth the present trustworthiness rank 
of the voter giving such vote. We continue such iterations 
until the ranks stop changing significantly, i.e., we stop when 
\\p C p + 1 )— p W || 2 < e, where e is a threshold corresponding to 
a desired precision adequate for the particular application; 
in our experiments it was in the range 10 -6 — 10 -12 with 
the algorithm terminating after 10 — 40 iterations. 

The value of the parameter a determines the robustness 
of our algorithm against unfair voters. Clearly, higher values 
of a increasingly favor voters with a high compliance with 
the prevailing community sentiment and penalize harsher for 
votes given to less favoured candidates. While this makes 
our system more robust, large values of a increasingly 
marginalize a significant fraction of honest, but less success- 
ful voters. In our experiments the values a < 1.5 were in- 
sufficient to obtain satisfactory robustness; the values 1.5 < 
a < 3 gave excellent performance without marginalizing a 
large number of voters, with the value a = 2 chosen for our 
implementation of the MovieTrust (see Section 2]). 

Our algorithm will eventually terminate with any strictly 
increasing, twice continuously differentiable function f(x) in 
place of x a , but we saw little value in such choices. As it 



2. the a + 1 norm HTHc+i of the trustworthiness vector 
T(p) is maximized. 

The reason for considering ||T(p)|| a +i rather than ||T(p)|| Q 
in the second condition will be clear below. Note that, in a 
sense, this gives "the benefit of the doubt" to all voters, giv- 
ing them the largest possible joint trustworthiness rank (in 
terms of the a + 1 norm of the trustworthiness vector), while 
maintaining for each list the same unit 2-norm of the vector 
corresponding to ranks of all objects on that list. Thus, if 
we now define 



F(p) = (\\T\\ a+1 ) 



(4) 



then our aim is equivalent to maximizing F(p), subject to 
the constraints 



I < I < L 



For this purpose we introduce for each list A; a Lagrangian 
multiplier Ai, define A = (Aj, : 1 < I < L) and look for the 
stationary points of the Lagrangian function 



$(p,a) = f(p)-Ea 9 (-i+E p : 

9=1 



Let l,m be list indices and i,j item indices; then, using 
the fact that the trustworthiness functions T r (p) are linear, 
we obtain 



dF ^=(a + l) £ T?{p); (5) 

r : f—yti 

= (a + l) £ T?(p)-2\ lPll (6) 



dpu 

d$(p,\) 



dpu 
d 2 F(p) 

dplidpmj 



t(a + l) Yl T r\p)- (7) 

r : t — yli,mj 
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JVF(p)), 





9(a) case when h is sufficiently 
small 



9(b) additional initial line search 



Figure 2: A geometric representation of the iterative procedure 



Note that, since every voter votes for at most one item on 
each list, ,f_ f q (p) = q f or ev ery i, j such that i 7^ j. Also, 

and thus 



dpudpij 

if (p, A) is a stationary point of $ then a$ ^' A - ) 



a + 1 



This yields 



E T r(P>- 



E T r(P)) > 



(8) 



2 ,2 _ (q+i) z 
Pii A i — : 



and by summing the above equations for all indices i of 
objects on the list I we get 



vx>«=^e( e 

i=l m \r:r— >-im 



(p) 



Since (p, A) is a stationary point of $ also = 0; this 

implies X]™=i Pk = 1) ano - smce by © A; must be positive, 
we obtain from the above and from © 



Pk 



X^r : 7 — >ii -^r IP) 



^2m (Sr : Mlm ^f"(p)) 



(9) 



We now define pi— » (p)* to be the mapping such that for 
an arbitrary p, 



(P)u 



Xr : r-Hi (p) 



(10) 



Recall that we denote by (x)i the projection of a vector 
x € R M to the subspace of dimension ni , which corresponds 
to a list A;. Thus, using {5). equations (|10|) can be written 
as 



(Vf(pl)i 
l(VF(p)) ; | 



(11) 



Consequently, (<?, A) is a stationary point of $ just in case 
<?* = (j, i.e., for all 1 < I < L, 



(*)« 



(V£(g))j 

|(VF(a))i| 



(12) 



Note also that in our algorithm the approximation p ( n+1 ) 
of the vector p obtained at the stage of iteration (n + 1) can 
be written as 



.(n + l) 



(p 



and that our algorithm will halt when p ( -™' get sufficiently 
close to a fixed point a = (a) * of the mapping p — >• (p) *, a 
stationary point of the Lagrangian function "^(p, A). As we 
will see, such a point a is a constrained local maximum of 
F(p), subject to the constraints |j(p)i|| = 1, 1 < I < L. Note 
that we do not need to prove the uniqueness of such a fixed 
point because our final ranks are defined as the outputs of 
our algorithm, and we only need to prove that our algorithm 
eventually terminates; for this purpose just approaching a 
fixed point is sufficient. 

Let p be an arbitrary vector such that ||(p)i||2 = 1 for all 
1 < I < L; we abbreviate (p)* with p * and let h — p * — p. 
By applying the Taylor formula with the remainder in the 
Lagrange form, we get that for some < c < 1 and p c = 
cp + (1 — c)p* We have 

F{p') = F{p + h) 

= F(p) + VF(p).h+l E f F F c) huhmj. 



(13) 



Since 



^ i=n -^ i= wmk-^ h (14) 

using also (|11[) . we get 

(VF(pi) ; ■ (h)i = (VF(^), ■ ( jMirSvT* - (p)' 



||(VF(p)) ; || 2 -(VF(p)) i .(p) ; 

IKv^iih-iKv^iihCp*),-^ 
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Let 61 be the angle between the unit vectors (p)i and {p*)i 
i.e., such that cos#; = (p)i ■ (p*)i- Then, (see Figure 2(a) I 



(h)i 



sm- 



COS0; _ 1 - (p)i ■ (p")i 



Combining the last two formulas we get 



(VF(#)« • (ft). 



|(VF(^),|| 2 || (A), | 



(15) 



Assume first that ||ft||2 is sufhciently small, so that the con- 
tribution of the second order terms in (|13[) is small compared 
to the first order term, and, consequently 



F(jT) « F(p) + VF(p) ■ ft. 



(16) 



Since ||V-F(/3)||2 is a continuous function, it achieves its min- 
imum on the compact set defined by our constraints, i.e., on 
the set C = {p : \\(p)i\\ = 1, 1 < / < L}. It is easy to 
see that the directional derivative of F(p) in the (radial) di- 
rection of vector p is always strictly positive; thus, on the 
compact set defined by our constraints its minimum must 
also be strictly positive. Thus, there exists k > such that 
||VF(p)||2 > k. for all p £ C; using this and by summing 
equations (|15|) for all 1 < I < L, we get 



VF{p) ■ h > 



I ft 



(17) 



This and (T5]) imply that, for p (n) and ft (n) = p ^ - p 
obtained in our iterations, 



(n) 



F(p 



)-F(p w )=F((p (n) y)-F(p w )> 



Consequently, since -F(p) must be bounded on a compact 



set defined by the constraints C, we get that ||ft ^ n '||§ must 
converge to zero, i.e., \\p^ — (p' n ' ) )*||2 will eventually be 
smaller than the prescribed threshold and the algorithm will 
terminate. 

If || ft || 2 is not sufficiently small so that the impact of the 
second order term in (|13[) makes the inequality (|17[) false, 
we supplement our algorithm with an initial phase which 
involves a line search. While this ensures a provable conver- 
gence of our algorithm, in all of our (very numerous) exper- 
iments such a line search was never activated; however, we 
were unable to prove without any additional assumptions 
that indeed such line search is superfluous, so we present a 
slight modification of our algorithm. Let 

f{p,t) = F{p+t{p*-m 

then, by the previous considerations, for sufficiently small 
t function f{p (n \t) is increasing in t. We now modify our 
iteration step as follows. If there exists to £ (0, 1) such that 
(p, to) = (testing this amounts to solving a low degree 
algebraic equation), then we let p ( n+1 > = p', where p' is 
defined so that for all 1 < I < L, and for the smallest root 
to of the above equation, 



(p')i 



tfW) l+ t ((pM*)i-(P (n) )i) 



\{p {n) )i + to((p *)i - {p {n) )i)h ' 

see Figure 2(b) if no such to exists, we let 

P \P ) • 



The convergence now follows from an argument similar to 
the one in the previous case. 

3.4 Rating Through Voting Procedure 

We now describe how we use the proposed voting algo- 
rithm in online rating systems. 

Our rating procedure starts by assigning to each product 
7T;, 1 < I < L, a voting list A;; the items on each voting 
list are the rating levels, comprising a "scale" from 1 to n, 
(in practice n < 10). Each rater is now construed as a 
voter V r , (1 < r < iV). We then process cast evaluations, 
by interpreting an evaluation of level i, (1 < i < n), of a 
product -Ki, (1 < / < L), by a voter V r , as his vote for item i 
on the list A;. After our iterative algorithm has terminated, 
each level 1 < i < n on each voting list A; has received a 
corresponding credibility degree pu. 

We can now obtain a rating score of each product 

7Ti, using such credibility degrees (p)i = {pi, . . . , p n ) in a way 
which suits the particular application best. For example, if 
the rating scores have to reflect where the community sen- 
timent is centered, we can simply choose as the rating score 
R(-7T;) of 71"; the rating level which has the highest credibility 
rank. Such a rating score does not involve any averaging 
and is most indicative of the community's prevailing senti- 
ment. On the other hand, if we wish to obtain a score which 
emphasises such prevailing sentiment, but, to a varying de- 
gree takes into account "dissenting views", one can form a 
weighted average of the form 



Pli 



where p > 1 is a parameter. As p increases, such rank con- 
verges to the previous, "maximum credibility" rank, while 
for smaller values of p we obtain a significant averaging ef- 
fect. In our implementations and testing of the MovieTrust 
we have used p = 2. 



4. IMPLEMENTATION 
4.1 Application Scenario 

One of the most popular rating scenarios on the web is 
movie rating. There are several movie rating systems which 
allow their users to rate movies and post their reviews. 
Based on users' evaluations of the movies, a system can cal- 
culate a rating score for every movie. Such rating scores are 
usually some forms of average of the evaluations posted by 
users. For example, IMDb is one of the best known online 
movie rating systems. It uses an algorithm for calculating 
rating scores for movies which the web site owners do not dis- 
close, wishing, as they declare, to keep it effective, but they 
explicitly say that their ranks are a weighted average |14j . 

Another movie rating and recommending system is Movie- 
Lens [2S] provided by GroupLens research lab at the Uni- 
versity of Minnesota [11] . MovieLens uses a collaborative 
filtering method to recommend movies to its users, based 
on their personal preferences. In this paper we test our sys- 
tem using a partial copy of MovieLens data obtained form 
GroupLens website [25]. The dataset contains 855598 rat- 
ing scores cast by 2113 users on nearly 10000 movies. This 
dataset for every movie also contains the corresponding rat- 
ing score given by the top critics of Rotten Tomatoes movie 
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Figure 3: Screen shot showing application of the 
MovieTrust Chrome Extension 



rating website [30] . another well known movie rating system 
which we also use for comparison purposes. 

4.2 MovieTrust 

To test our methodology, we have designed and imple- 
mented a movie rating system, which we call MovieTrust, 
aiming to robustly calculate rating scores for movies. Movi- 
eTrust comprises of three components. One is the rating 
calculation engine, the backbone of the application, which 
calculates the rating scores based on locally stored data. 
The second component is an API which makes MovieTrust 
services available to users all around the Internet. The third 
component is an extension for the Google Chrome browser. 
At the moment, this extension is developed specifically for 
IMDb website. 

When the MovieTrust extension is installed on the Chrome 
browser and the users visit IMDb page of a movie, the Movi- 
eTrust icon will appear on the right corner of the address bar 
and users can simply click on it and see what rating score 
has been calculated for that movie by our system. Figure [4721 
shows a screen shot of MovieTrust extension on the Chrome 
browser. 

The MovieTrust has a website [27] which provides all re- 
lated information about the tool. The extension is available 
for download, to allow easy evaluation of the performance 
of our system and comparison with the well known movie 
ratings system, IMDb. 

5. EVALUATION 

We evaluate two aspects of the performance of our method: 
robustness against unfair evaluations and the accuracy of 
calculated rating scores. 

5.1 Evaluating Robustness 

We use three scenarios to evaluate the robustness of our 
method. 

5.1.1 Evaluation Scenarios 

Scenario 1. We start evaluation of our system with a very 
simple example, which nevertheless illustrates the heuris- 
tics which was the starting point for our algorithm. Ta- 
ble 1(a) shows the votes cast by 5 voters (ri, . . . , rs) on 5 
items (7i, . . . , 1$) in 6 different elections (Ai, . . . , Ae). For 
example, the first row shows that in the first election n, ri 
and r-i have voted for 7i and and rs have voted for I2. 

In the usual election model, in the last election 7i must 
win because it has received 3 votes out of 5. However, when 
we look at the history of the voters' choices, and r^, who 
have voted for I2 in last election, have voted for the winners 
in all past elections. This means that they have been always 



behaving close to community consensus. On the other hand 
ri , r4 and 7-5 who have voted for 7i do not conform to com- 
munity consensus in most of the past elections. Therefore, 
we can argue that in the last election the I2 should win, 
despite of the fact that the majority have voted for 7i. 
Table [1(b) shows the results of running our model on data 



provided in Table 1(a) During initialisation of our algo- 

3 ~ 0.83 while 



0.55. However, the 



rithm, item 7i receives initial rank of 
I2 receives initial rank of . 2 « 

% /3 2 +2 2 

final ranks obtained after 16 iterations, shown in table 1(b) 
are 0.2 and 0.98 respectively; thus, I2 wins the last election 
despite of the majority voting for Ii . 



Scenario 2. In the second scenario, we use synthetic data 
to show how robust our system is against unfair evalua- 
tions such as collusion. Our new dataset contains 60 vot- 
ers voting in 7 elections, with each election list having 8 
items (7i, . . . ,7s). First 15 voters are "honest voters" who 
do not cast unfair evaluations. We generate these users and 
their corresponding votes by replicating voters ri to r$ three 
times to generate 15 honest voters (ri, . . . , fis). We also gen- 
erate 45 voters who make random choices in all elections, ex- 
cept the last one. In the last, 7 th election, the unfair voters 
collude in order to manipulate the outcome of the election, 
trying to secure election of item 7s. On the other hand, 
all 15 honest voters vote for 7i. 

Thus, we have 6 elections in each of which | of all voters 
cast their evaluations according the pattern in scenario 1, 
while I of all voters cast random votes, and one election in 
which the collusion attack happens. Note that the number 
of "unfair" voters is 3 times the number of fair voters. In a 
normal election scenario, it is obvious that in the last elec- 
tion item 7s would win, having received 3 times the number 
of votes of its opponent 7i. In our method, during the ini- 
tialisation process, item 7i, having obtained 15 votes, gets 

i 0.32 and item 7s, hav- 



the initial ranking score 



Vl5 2 +45 2 



ing obtained 45 votes gets an initial score of ■ 



.95. 



- v /l5 2 +45 2 

However, after we run our algorithm (which terminates af- 
ter 16 iterations) the scores are essentially reversed and 7i 
wins the election, having obtained credibility degree ~ 0.96 
versus the credibility degree ~ 0.28 obtained by h. Such an 
outcome is to be expected from a robust voting system. Fig- 
ure [5TTTT] shows the ranks produced by our algorithm, and, 
for comparison, in the last column, the initial ranks based 
on (normalized) simple vote counting. 



Scenario 3. In this scenario, we use real world evaluation 
results, from the MovieLens dataset. We inject different lev- 
els of collusion attacks to the dataset and check robustness 
of our method. We also compare the performance of our 
method with two commonly used rating models: Averag- 
ing and Majority. We are aware that there are many recent 
methods which try to calculate rating scores after detecting 
and eliminating unfair evaluations such as [28] or [39] . How- 
ever, it should be noted that in the presence of the massive 
collusion attacks which we inject to dataset, the control of 
the products would inevitably be given to the colluders. All 
existing collusion detection systems use majority as a pri- 
mary indicator, such as |17| . or as a secondary indicator to 
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Table 1: A simple example for showing impact of voter trustworthiness on election result (Scenario 1). 
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Table 2: The Results of running our model in the presence of a large number of colluders (Scenario 2). 
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tune their primary indicators such as [28] or [39| . Therefore, 
they must fail when the number of unfair votes is far larger 
than the number of the fair ones. This explains our choice 
of methods used for comparison with our method. 

To evaluate methods, we calculate a rating score for each 
movie from the data set used by applying ours, Averaging 
and Majority methods. We then add collusion attacks to 
the dataset and recalculate rating scores of movies and find 
the difference between new ranks and the old ranks. To 
quantify this differences for each method in each run, we 
calculate the 'Root-Mean-Square (RMS)' of the differences 
between new ranks and old ranks for all movies. 

To demonstrate the robustness of our system, we try to 
promote all movies which have very low rating scores, i.e., 
movies which majority of users have given a rating score 
lower than 3, by posting evaluations with values of 10 to 
sharply increase their rating scores. We also try to demote 
all movies having a very high rating score, i.e., movies with 
a rating score higher than 8 from the majority of raters, by 
injecting evaluations with value of 1 to sharply decrease the 
rating scores of such movies. For each movie to be promoted 
or demoted we inject different levels of collusion attacks, 
ranging from 0% to 200%. We then compare the RMS values 
of the differences to show how these three methods behave 
in the presence of such attacks. 

Suppose that for a movie A;, mi evaluations have been 
posted by real world users. To test our method with a col- 
lusion attack of size 200% of the number of the real, already 
existing votes, we inject 2 x mi votes which try to promote 
or demote the product, as explained above. This clearly 
creates a massive collusion attack on the product (here a 
movie) which sharply changes the majority consensus to- 
ward unfair evaluations, thus making every existing method 



likely to fail. 



Figure 4(a) shows the results of injecting collusion at- 



tacks in order to promote low ranked movies. As shown in 
the chart, the Majority model is highly vulnerable against 
unfair votes; its corresponding evaluation level and thus the 
corresponding RMS error is sharply increased to the highest 
possible value and remains steady at that level. The averag- 
ing model performs better, but nevertheless its RMS error 
steadily increases as more unfair evaluations are added. The 
RMS of our method is extremely small, increasing slightly as 
the number of unfair evaluations gets close to 200%, never 
getting close to serious malfunctioning, in sharp contrast 
with other two methods. 



Likewise, Figure 4(b) shows that our model behaves well 
also in the presence of a massive demoting collusion. While 
the RMS for our method is around 1, the Majority model 
jumps immediately to the highest possible level and Aver- 
aging model steadily moving towards the unfair evaluations. 

The reason why the RMS error of our method is larger 
in the presence of demoting attacks is that for high ranked 
movies the total number of existing votes is much larger than 
the number of votes for low ranking movies targeted in pro- 
moting attacks. Consequently the number of injected collu- 
sive votes in the case of demoting attacks is much larger; for 
some movies the number of collusive votes exceeded 5000 
which accounts for higher overall impact on the system's 
RMS value. However, even with such extreme collusion at- 
tacks, our system is remarkably robust. 

5.1.2 Discussion 

We tested our system by generating groups of voters with 
various types of voting patterns. Our experiments show that 
as the number of elections increases, so does the robustness 
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Figure 4: Comparison of behavior of three methods in the presence of different levels of injected collusion 
attacks (Scenario 3) 



of the system, as expected. The reason is that the larger 
the number of elections, the harder it is to manipulate our 
"globally obtained" trustworthiness ranks, in which the rank 
of any voter depends on ranks of all other voters. Also, 
as the number of items on lists increases, the possibility of 
casting same random votes of several voters decreases and 
consequently the system performs more robustly. Thus, in 
order to skew a particular election, colluding voters need a 
long history of "honest looking" activity, or a long history 
of massive collusion attacks gone undetected by other, more 
standard methods of collusion detection which can be used 
in conjunction with our method, or if the attackers actually 
manage to change the sentiment of the community. 

To summarize, the followings are the possible scenarios 
in which colluders might be able to manipulate result of an 
election: 

• Colluders build up a large team of voters. All these 
voters must vote over rather long period of time in ac- 
cordance with community sentiment in order to earn 
hight trustworthiness ranks and then attempt to ma- 
nipulate a particular election. This team must be large 
enough to be able to dominate others while trying to 
promote or demote a product. This is unlikely to hap- 
pen because research [28; 39 show that collusion hap- 
pens in a short period of time and only on a few num- 
ber of elections. In online rating systems, with millions 
of members and products, it is quite hard to collect a 
large fraction of members and encourage them to rate 
most of the existing products in the system honestly 
for a long period of time and then ask them to rate a 
particular product as the collusion team intends. 

• The second possibility is to build a very large group 
of colluders and manipulate all elections in the sys- 
tem to change the sentiment of the community toward 
intension of collusion team. In this case since the com- 
munity sentiment is toward collusion party, the honest 
voters will be marginalized and eventually their im- 
pact will be eliminated. This might be possible for 
small communities with a small number of members 
and items. It is obvious that it is quite impossible to 
run this scenario in a huge online rating system like 
Amazon or IMDb, and that such behavior would go 



undetected by other, relatively simple collusion detec- 
tion methods which easily detect such long stretches 
of extreme behavior and which can be deployed in par- 
allel, with minimal additional complexity and cost of 
the system. 

5.2 Evaluating Accuracy of Results 

5.2.1 Evaluation Scenario 

To evaluate our model we need some 'gold standards' or 
ground truth levels to compare our calculated rating scores 
with and check how they deviate from such gold standards. 
Product rating is usually is a subjective task, i.e., the result 
of evaluating a product is strongly related to the personal 
interest and experiments of the evaluator. This subjective- 
ness even is getting more serious when a movie is evaluated. 
It is impossible to evaluate a movie automatically and find 
its real quality level and compare it with our scores. On the 
other hand, people have different tastes and interests and 
the rating scores they give to a movie sometimes may be in 
contrast. So, it is not possible to ask a few people to watch 
a movie and provide us a reliable rating score for the movie 
as a ground truth level. 

To solve this problem, we rely on rating scores calculated 
by two well-known existing movie rating systems. The first 
one is IMDb. The IMDb calculates a rating score for every 
movie using many parameters such as the age of the voter, 
the value of the posted rating score, the distribution of the 
votes, etc. IMDb does not disclose its method of calculating 
ratings, but the success of the website in the long run is a 
good indication that the calculated rating scores are realistic 
and reliable. We extracted the rating score of every movie 
from IMDb. 

The second gold standard which we use is the rating scores 
given to movies by top critics in Rotten Tomatoes (referred 
as RTCritics in followings) movie ranking website. The crit- 
ics are better informed and experienced than an average 
user. Although they might have their own personal inter- 
ests, the community of critics as domain experts can provide 
dependable ratings of movies. We extracted the top critics' 
movie rating scores from the MovieLens dataset, as we have 
explained in application scenario (see Section l4.ip . 

First, we compare ranking scores calculated in three mod- 
els. We have more than 9000 movies in our database. We 
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Figure 5: Average rating scores calculated for movies using three models: MovieTrust, IMDb and RTCritics 



have RTCritics ranks only for about 4000 movies but this is 
still too large to fit into a readable chart. So, for every 100 
movies we calculate the average score (mean). Note that 
this is done only to make graphical representation of rating 
scores of more than 4000 moves feasible; the RMS values of 
the changes due to collusion attacks calculated in the previ- 
ous subsection are obtained from the entire dataset, without 
any averaging. We use such calculated averages to compare 
our rating scores with those provided by IMDb and Rotten 
Tomatoes. Figure [5] shows the result of this comparison. In 
general, our scores are higher than RTCritics and very close 
to the scores obtained from IMDb. The "harsher" ratings 
from Rotten Tomatoes are to be expected from a community 
described aptly by the label 'the critics'. Moreover, not only 
do our ratings conform to the ratings from IMDb, but also 
when our ratings do differ, the difference is nearly always 
towards the rating scores provided by RTCritics. 

5.2.2 Discussion 

The accuracy evaluation results explicitly show that the 
results obtained by our system are very close to the rating 
scores calculated by IMDb and RTCritics. The slight differ- 
ences between our method and other two are accounted for 
by the following reasons. These three rankings are based on 
three different dataset: MovieLens (the dataset we used), 
IMDb dataset and Rotten Tomatoes dataset; despite the 
fact that the number of samples in all of these data sets 
is sufficiently large to expect the same statistical features, 
small differences still would occur, most notably because the 
rating scores calculated by our model are based on a dataset 
which holds evaluation data only up to 2009. 

More important is a high degree of similarity of the plots 
of the ranks produced by these three systems, see Figure [5] 
strongly supporting the dictum that "the wisdom of the 
crowd is reliable" , see [33J , in the sense that finding the com- 
munity sentiment on the quality of a product is at least as 
dependable and, in all likelihood, more dependable than any 
other kind of metrics, such as various averages and the simi- 
lar. This we explain by the fact that our system compensates 
for anomalous behavior of voters as well as for natural diver- 
sity of opinions, allowing detection of the true sentiment of 
the community, doing this without an external supervision 
or an explicit "0-1" classification of voters according to their 
perceived reliability. 

In all practical systems the evaluations submitted by the 



participants are on a coarse granularity involving at most 
ten levels or less. Given the large number of voters, and rel- 
atively small number of 'bins' in which their votes for each 
product can be placed, insures that real sentiment of the 
community will be captured by the emerging set of voters 
with high trustworthiness ranks. This insures high efficiency 
of our method, especially for very large online rating sys- 
tems, for which trust management is both of the highest 
importance and one of the most challenging problems. 

6. RELATED WORK 

As demonstrated in [57], as the rating systems get more 
and more popular and more users rely on them to decide 
on purchases from online stores, the temptation to obtain 
fake rating scores for products or fake reputation scores for 
people has dramatically increased. To detect such reviews, 
Mukherjee et.al., [25] propose a model for spotting fake re- 
view groups in online rating systems. The model analyzes 
textual feedbacks cast on products in Amazon online market 
to find collusion groups. They employ FIM 1 algorithm to 
identify candidate collusion groups and then use 8 indicators 
to identify colluders. In [23J authors assign every review a 
degree of spam value, and based on these values they iden- 
tify most suspicious users and investigate their behavior to 
find most likely colluders. In [18] authors try to identify fake 
reviews by looking for unusual patterns in posted reviews. 

In a more general setup, collusion detection has been stud- 
ied in P2P and reputation management systems; good sur- 
veys can be found in [B] and [35]. EigenTrust [TS] is a well 
known algorithm proposed to produce collusion free repu- 
tation scores; however, authors in [22] demonstrate that it 
is not robust against collusion. Another series of works [241 
1381139] use a set of signals and alarms to point to a sus- 
picious behavior. The most famous of all, the PageRank 
algorithm [29] was also devised to prevent collusive groups 
from building fake ranks for pages on the web. 

Iterative methods for trust evaluation and ranking have 
been pioneered in |20||4l] . Some of the ideas from these pa- 
pers were among the staring points of 7-9 , as the authors 
mention; the proof techniques which we used in this paper 
were inspired by the techniques developed in [7J. However, 
our present method sharply differs from all of these prior 
iterative methods by virtue of entirely decoupling the credi- 
bility assessment from the score aggregation. More precisely, 
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the main idea used in [20]|4T] and in [7}{9] is to produce at 
each stage of iteration an approximation of the final rat- 
ings of the objects and then calculate for each rater the 
degree of her "belief divergence" from such calculated ap- 
proximations, i.e., calculate some distance measure between 
her proposed ranks and these approximations of the final 
ranks of objects. In the subsequent round of iteration a 
new approximation of the ranks of all objects is obtained 
as a weighted average of the ranks proposed by raters, with 
the weight given to each rater's rank inversely related to 
her corresponding distance from the approximate final ranks 
obtained in the previous round of iteration. Thus, in this 
manner, the ranks of objects are produced simultaneously 
with an assessment of the trustworthiness of the raters as 
reflected in the weights given to their proposed ranks. 

In contrast, our iterative method operates only on credi- 
bility assessment of raters and on the levels of the commu- 
nity approval of items, which are obtained without using 
the fact that the items voted on are rating levels. In fact, 
as it is obvious from our algorithm, we have never used any 
comparisons of the proposed rating levels or even any or- 
dering of the rating levels. We only rely on the levels of 
concurrence of the opinions of raters. Thus, our system can 
subsequently choose how to use such estimates of the 'com- 
munity sentiment' to produce the aggregate rating scores of 
items. 

The second author and his co-authors, unaware of the pi- 
oneering work in [7H9l l20|[4T] . have proposed in [13] a fixed- 
point algorithm for trust evaluation in online communities 
and subsequently in [2] an algorithm for aggregating assign- 
ment marks given by multiple assessors. This method was 
later applied to aggregation of sensor readings in wireless 
sensor networks, in the presence of sensor faults [5]. He 
also proposed the idea of applying an iterative procedure 
for vote aggregation to his collaborators in |21| ; however 
no proof of convergence of the method was provided there, 
and, more importantly, the proposed method had some se- 
rious shortcomings. In the notation of the present paper, 
denoting again the total number of voters by N and total 
number of voting lists by L, the recursion for computing the 
trustworthiness T r of a rater V r proposed in |21j was given 
by 

p 

i / V T^"' \ p+1 

Unfortunately, the normalizing factor in the denominator 
on the righthand side can become excessive as the num- 
ber of voters who did not vote in any elections in which V r 
has voted increases, making the rank computation unsta- 
ble. Also, the exponent is always smaller than 1, and 
this severely limits the robustness of the proposed method 
against collusion attacks. The algorithm aimed to relate 
(a power of) the ratios between trustworthiness of any two 
voters to the ratios of the numbers of votes received by the 
candidates chosen by these voters. It also normalized the 
trustworthiness of voters, instead of normalizing the credi- 
bility of levels; however, as we do it in our present algorithm, 
normalizing credibility of levels, which are going to be used 
as weights in a subsequent (independent) computation of 
ranks of objects, not only makes more sense but also allows 
an elegant proof of convergence, missing in |21j . 



In summary, unlike the existing models for collusion de- 
tection, we do NOT rely on any clustering techniques, local 
indicators or averaging; also, our method does not rely on 
any approximation of the final rating scores, making rating 
an entirely independent process from the credibility assess- 
ment. 

7. CONCLUSION 

As we have mentioned, existing iterative methods, such 
as [7H9l l20ll4P] . approximate ranks using techniques involv- 
ing weighted averages. However, averages generally have the 
propensity to blur statistical features because they smooth 
out data. In our method, the trustworthiness of raters is 
computed purely from the concurrence of opinions, with- 
out any averaging at all. In fact, note that in our "rating- 
through-voting" method, the ordering of the range of cred- 
ibility levels (i.e., an increasing ordering from e.g., 1 to 10) 
is NOT considered at all - we treat such domain as an un- 
ordered set, and only consider the concurrence of opinions. 
Such obtained trustworthiness of raters and the credibility 
of items (in this case rating levels) can then be used to ob- 
tain the values of the rating scores in a completely decoupled 
way, for example, by taking a weighted average with weights 
obtained as some function of the credibility scores obtained 
for each rating level, or by choosing the highest ranked level 
or many other possible ways, depending on what kind of 
statistical feature we are mining the submitted evaluation 
data. 

However, in our future work, we will further refine our 
method by taking into account the ordering of the rating 
levels, with the aim to produce a complete yet fully flexible 
rating methodology which can be precisely tuned to pro- 
duce rating scores reflecting any desired statistical feature 
of evaluation data. 
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